Bayesian ML Exam — Formula Cheatsheet
Organized by exercise type number. Every formula you need, nothing you don't.
Type 1: Bayes' Rule & Posterior Computation
| Formula | Expression |
|---|---|
| Bayes' Rule | $p(\theta \mid x) = \dfrac{p(x \mid \theta) \cdot p(\theta)}{p(x)}$ |
| Proportional form | $p(\theta \mid x) \propto p(x \mid \theta) \cdot p(\theta)$ |
| Evidence | $p(x \mid m) = \int p(x \mid \theta, m) \cdot p(\theta \mid m) \, d\theta$ |
| Joint | $p(x, \theta) = p(x \mid \theta) \cdot p(\theta)$ |
Beta Function Trick (for integrals over $\theta \in [0,1]$)
How to use: Multiply likelihood × prior → read off powers $p$ and $q$ → plug into formula → multiply by any constant outside the integral.
Type 2: Beta-Bernoulli Coin Toss
| Formula | Expression |
|---|---|
| Bernoulli likelihood | $p(D \mid \mu) = \mu^{N_1}(1-\mu)^{N_0}$ |
| Beta prior | $p(\mu) = \text{Beta}(\mu \mid \alpha, \beta) = \dfrac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1}$ |
| Posterior | $p(\mu \mid D) = \text{Beta}(\mu \mid \alpha + N_1,\; \beta + N_0)$ |
| Evidence | $p(D) = \dfrac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot \dfrac{\Gamma(\alpha+N_1)\Gamma(\beta+N_0)}{\Gamma(\alpha+\beta+N)}$ |
| Predictive | $p(x_{\text{next}}=1 \mid D) = \dfrac{\alpha + N_1}{\alpha + \beta + N}$ |
| Beta mean | $\mathbb{E}[\mu] = \dfrac{\alpha}{\alpha + \beta}$ |
| Gamma (integers) | $\Gamma(n) = (n-1)!$ |
Key: $N_1$ = count of ones, $N_0$ = count of zeros, $N = N_1 + N_0$. No binomial coefficient in the likelihood!
Type 3: Gaussian Posterior & Evidence
Setup: Likelihood $\mathcal{N}(x \mid \mu, \sigma^2)$, Prior $\mathcal{N}(\mu \mid \mu_0, \sigma_0^2)$
| Formula | Expression |
|---|---|
| Posterior variance | $\dfrac{1}{\sigma^2_{\text{post}}} = \dfrac{1}{\sigma^2_0} + \dfrac{1}{\sigma^2}$ |
| Posterior mean | $\mu_{\text{post}} = \sigma^2_{\text{post}} \left( \dfrac{\mu_0}{\sigma^2_0} + \dfrac{x}{\sigma^2} \right)$ |
| Evidence | $p(x) = \mathcal{N}(x \mid \mu_0,\; \sigma^2_0 + \sigma^2)$ |
Shortcut: Posterior precision = sum of precisions. Posterior mean = precision-weighted average. Evidence variance = prior variance + likelihood variance.
When both variances = 1: $\sigma^2_{\text{post}} = 0.5$, $\mu_{\text{post}} = (\mu_0 + x)/2$.
Type 4: Model Evidence & Bayesian Model Averaging
| Formula | Expression |
|---|---|
| Evidence | $p(x \mid m_k) = \int p(x \mid \theta, m_k) \cdot p(\theta \mid m_k) \, d\theta$ |
| Model averaging | $p(x) = \sum_k p(x \mid m_k) \cdot p(m_k)$ |
| Gaussian evidence | $p(x \mid m) = \mathcal{N}(x \mid \mu_0,\; \sigma^2_0 + \sigma^2)$ |
Type 5: Model Comparison & Bayes Factor
| Formula | Expression |
|---|---|
| Bayes Factor | $B_{12} = \dfrac{p(D \mid m_1)}{p(D \mid m_2)}$ |
| Posterior ratio | $\dfrac{p(m_1 \mid D)}{p(m_2 \mid D)} = \dfrac{p(D \mid m_1)}{p(D \mid m_2)} \cdot \dfrac{p(m_1)}{p(m_2)} = B_{12} \cdot \dfrac{p(m_1)}{p(m_2)}$ |
| BF from posteriors | $B_{12} = \dfrac{p(D \mid m_1)}{p(D \mid m_2)} = \dfrac{p(m_1 \mid D)}{p(m_2 \mid D)} \cdot \dfrac{p(m_2)}{p(m_1)}$ |
| Posterior model prob | $p(m_k \mid D) = \dfrac{p(D \mid m_k) \cdot p(m_k)}{p(D)}$ |
Interpretation: Posterior ratio > 1 → model 1 wins. Evidence = ∫ likelihood × prior dθ.
Type 6: Bayesian Classifier
| Formula | Expression |
|---|---|
| Posterior | $p(C_k \mid x) = \dfrac{p(x \mid C_k) \cdot p(C_k)}{p(x)}$ |
| Evidence | $p(x) = p(x \mid C_1) \cdot p(C_1) + p(x \mid C_2) \cdot p(C_2)$ |
| Decision boundary | $\dfrac{p(x \mid C_1) \cdot p(C_1)}{p(x \mid C_2) \cdot p(C_2)} = 1$ |
| Error probability | $P(\text{error}) = \int_{\text{decide }C_2} p(x \mid C_1)p(C_1)dx + \int_{\text{decide }C_1} p(x \mid C_2)p(C_2)dx$ |
Boundary shortcuts: - Different covariances → quadratic (parabola) - Shared covariance → linear (straight line)
Learned class prior (Beta): $p(C_1 \mid x_\bullet, D) \propto \int p(x_\bullet \mid C_1)\, p(C_1 \mid \theta)\, p(\theta \mid D)\, d\theta$
Type 7: Ball/Box Conditional Probability
| Formula | Expression |
|---|---|
| Product Rule | $P(A, B) = P(A \mid B) \cdot P(B)$ |
| Total Probability | $P(A) = \sum_i P(A \mid B_i) \cdot P(B_i)$ |
| Bayes' Rule | $P(B_i \mid A) = \dfrac{P(A \mid B_i) \cdot P(B_i)}{P(A)}$ |
Without replacement: Update counts after each draw (decrease both the drawn type and total by 1).
Type 8: Gaussian Mixture Model (GMM) Form
| Formula | Expression |
|---|---|
| Joint (one-hot $z_n$) | $p(x_n, z_n) = \prod_{k=1}^K \left(\pi_k \cdot \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\right)^{z_{nk}}$ |
| Marginal | $p(x_n) = \sum_{k=1}^K \pi_k \, \mathcal{N}(x_n \mid \mu_k, \Sigma_k)$ |
| Mixing constraint | $\sum_{k=1}^K \pi_k = 1$ |
Spotting trick: Both $\pi_k$ AND $\mathcal{N}$ must be inside the parentheses raised to $z_{nk}$ (not $z_n$). Joint uses product ∏, marginal uses sum Σ.
Type 9: Variational Free Energy (VFE)
| Formula | Expression |
|---|---|
| VFE functional | $F[q] = \int q(z) \log \dfrac{q(z)}{p(x,z)} \, dz$ |
| Upper bound | $F[q] \geq -\log p(x)$ for any $q(z)$ |
| Equality | $F[q] = -\log p(x)$ when $q(z) = p(z \mid x)$ |
Key: VFE minimizes $KL(q(z) \,|\, p(z \mid x))$ and gives an upper bound on negative log evidence.
Type 10: Free Energy Principle (FEP) — Concepts
| Principle | Statement |
|---|---|
| Generative model | Agents MUST have an internal model of how sensory data is generated |
| Perception | Minimizes free energy (update beliefs to match observations) |
| Action | Minimizes expected free energy of future states |
| Goals | Encoded as target priors in the generative model |
| Decision making | Minimization of a functional of beliefs about future states |
Rules of thumb: Always MINIMIZE (never maximize) free energy. Beliefs, not cost functions. Desired (target priors), not actual future.
Type 11: Factor Analysis & Marginal Gaussian
| Formula | Expression |
|---|---|
| Model | $x = Wz + \epsilon$, $z \sim \mathcal{N}(0, I)$, $\epsilon \sim \mathcal{N}(0, \Psi)$ |
| Conditional | $p(x \mid z) = \mathcal{N}(x \mid Wz, \Psi)$ |
| Joint | $p(x, z) = \mathcal{N}(x \mid Wz, \Psi) \cdot \mathcal{N}(z \mid 0, I)$ |
| Marginal | $p(x) = \mathcal{N}(x \mid 0,\; WW^T + \Psi)$ |
Note: $WW^T$ (not $W^TW$) — must give $N \times N$ covariance for $x$.
Type 12: Recursive Filtering / Kalman
| Formula | Expression |
|---|---|
| Observation model | $x_t = \theta + \epsilon_t$, $\epsilon_t \sim \mathcal{N}(0, \sigma_\epsilon^2)$ |
| Likelihood | $p(x_k \mid \theta) = \mathcal{N}(x_k \mid \theta, \sigma_\epsilon^2)$ |
| Recursive update | $p(\theta \mid D_k) \propto p(x_k \mid \theta) \cdot p(\theta \mid D_{k-1})$ |
| Kalman gain | $K_k = \dfrac{\sigma_{k-1}^2}{\sigma_{k-1}^2 + \sigma_\epsilon^2}$ |
| Mean update | $\mu_k = \mu_{k-1} + K_k \, (x_k - \mu_{k-1})$ |
| Variance update | $\sigma_k^2 = (1 - K_k) \, \sigma_{k-1}^2$ |
State-space update: $p(z_t \mid x_{1:t}) \propto p(x_t \mid z_t) \sum_{z_{t-1}} p(z_t \mid z_{t-1})\, p(z_{t-1} \mid x_{1:t-1})$
As $k \to \infty$: $\sigma_k^2 \to 0$, $K_k \to 0$, $\mu_k \to$ true value (stationary).
Type 13: Log-Likelihood & MLE
| Formula | Expression |
|---|---|
| Log-likelihood | $\log p(D \mid \theta) = \sum_n \sum_k y_{nk} \log \mathcal{N}(x_n \mid \mu_k, \Sigma_k) + \sum_n \sum_k y_{nk} \log \pi_k$ |
| MLE mean | $\hat{\mu}k = \dfrac{\sum_n y$} x_n}{N_k}$ where $N_k = \sum_n y_{nk |
| MLE covariance | $\hat{\Sigma}k = \dfrac{1}{N_k} \sum_n y_k)^T$} (x_n - \hat{\mu}_k)(x_n - \hat{\mu |
| Log-Gaussian | $\log \mathcal{N}(x \mid \mu, \Sigma) = -\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) + \text{const}$ |
Covariance trick: Must use outer product $(x-\mu)(x-\mu)^T$ (gives matrix), NOT inner product $(x-\mu)^T(x-\mu)$ (gives scalar).
Decision boundary: Different covariances → quadratic (parabola). Shared covariance → linear (straight line).
Type 14: GMM, VFE, FEP & Concept Questions
VFE Properties
- $F[q] \geq -\log p(x)$ — upper bound on negative log evidence
- Equality at $q(z) = p(z \mid x)$ (true posterior, NOT the prior)
- Minimizing VFE = minimizing KL divergence to true posterior
Bayesian vs MLE
- Likelihood is a function of parameters: $L(\theta) = p(D \mid \theta)$
- MLE maximizes $p(D \mid \theta)$; MAP maximizes $p(D \mid \theta)p(\theta)$ — equal only with uniform prior
- As data grows: likelihood narrows, prior stays fixed → MLE ≈ MAP
- Bayesian evidence = fit minus complexity (built-in overfitting protection)
Gaussian Properties
- Linear combination of Gaussians → Gaussian
- Products/ratios of Gaussians → NOT Gaussian
Discriminative Predictive
Generative Model for Signal Recovery
Quick-Reference Concept Facts
- Bayesian methods are NOT faster than MLE — often more computationally expensive
- No train/test split needed in Bayesian approach — all data used for inference
- Beta is the conjugate prior for Bernoulli (not Gaussian)
- MLE covariance uses $1/N_k$ (not $1/N$), and requires $y_{nk}$ selector
Gamma Factorial Quick Values
| $n$ | $\Gamma(n) = (n-1)!$ |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 6 |
| 5 | 24 |
| 6 | 120 |
| 7 | 720 |
| 8 | 5040 |
| 10 | 362880 |